code interpreter
OccamLLM: Fast and Exact Language Model Arithmetic in a Single Step
Language model systems often enable LLMs to generate code for arithmetic operations to achieve accurate calculations. However, this approach compromises speed and security, and fine-tuning risks the language model losing prior capabilities. We propose a framework that enables exact arithmetic in *a single autoregressive step*, providing faster, more secure, and more interpretable LLM systems with arithmetic capabilities. We use the hidden states of a LLM to control a symbolic architecture that performs arithmetic.
Enhancing LLM Code Generation Capabilities through Test-Driven Development and Code Interpreter
Jalil, Sajed, Saha, Shuvo, Seym, Hossain Mohammad
Over the past few years, improving LLM code generation capabilities has been a key focus in NLP research. Despite Bengali having 242 million native speakers worldwide, it receives little attention when it comes to training LLMs. More recently, various fine-tuning and augmented generation techniques have been employed to significantly enhance code generation performance. However, they require considerable expertise and resources to utilize effectively as an end user. The goal of our work is to democratize access to powerful code generation tools in resource-constrained emerging markets, enabling users to leverage them in their native language. We introduce a novel approach that combines Test-Driven Development (TDD) and Code Interpreter (CI), utilizing open-weight models, which improves the baseline accuracy for code generation with Bengali prompts and achieves an overall accuracy of 85%. Our approach requires no finetuning and proves that even the smallest models in the same family can attain up to 98% accuracy compared to the largest models. All of our results are publicly shared in GitHub for validation and reproducibility.
- Asia > Bangladesh > Dhaka Division > Dhaka District > Dhaka (0.05)
- North America > United States > New Mexico > Bernalillo County > Albuquerque (0.04)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Automatic Programming (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.32)
TUMIX: Multi-Agent Test-Time Scaling with Tool-Use Mixture
Chen, Yongchao, Chen, Jiefeng, Meng, Rui, Yin, Ji, Li, Na, Fan, Chuchu, Wang, Chi, Pfister, Tomas, Yoon, Jinsung
While integrating tools like Code Interpreter and Search has significantly enhanced Large Language Model (LLM) reasoning in models like ChatGPT Agent and Gemini-Pro, practical guidance on optimal tool use is lacking. The core challenge is effectively combining textual reasoning, coding, and search for diverse questions. In this paper, we propose Tool-Use Mixture (TUMIX), an ensemble framework that runs multiple agents in parallel, each employing distinct tool-use strategies and answer paths. Agents in TUMIX iteratively share and refine responses based on the question and previous answers. In experiments, TUMIX achieves significant gains over state-of-the-art tool-augmented and test-time scaling methods, delivering an average accuracy improvement of up to 3.55% over the best baseline on Gemini-2.5-Pro and Gemini-2.5-Flash across key reasoning benchmarks, with near-equal inference costs. We find that agent diversity and quality are crucial and can be enhanced by using LLMs to auto-optimize agent designs. Furthermore, TUMIX can halt refinement upon reaching sufficient confidence, preserving performance at only 49% of the inference cost. Further scaling can achieve higher performance, albeit at a greater cost.
- Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
R1-Code-Interpreter: LLMs Reason with Code via Supervised and Multi-stage Reinforcement Learning
Chen, Yongchao, Liu, Yueying, Zhou, Junwei, Hao, Yilun, Wang, Jingquan, Zhang, Yang, Li, Na, Fan, Chuchu
Practical guidance on training Large Language Models (LLMs) to leverage Code Interpreter across diverse tasks remains lacking. Our final model, R1-CI-14B, improves average accuracy on the 37 test tasks from 44.1% to 72.4%, Notably, R1-CI-14B also exhibits emergent self-checking behavior through code generation. While reinforcement learning (RL)-based fine-tuning has significantly improved LLMs' reasoning and planning Wang In contrast, symbolic code generation handles these rigorously and benefits from external tools (e.g., A key challenge is guiding LLMs to decide when to rely on textual reasoning versus programmatic solutions, given that most input questions lack explicit cues about which approach is best and the possible text/code solution space is large. OpenAI's GPT models address this by incorporating a Code Interpreter, allowing iterative code generation Interpreter implementations struggle to effectively steer between text and code, underutilizing symbolic capabilities. Recent work such as ToRL (Li et al., 2025b) and ReTool (Feng et al., 2025) investigates training reasoning models to integrate with Code Interpreters. To tackle these challenges, we present R1-Code-Interpreter, a framework for integrating Code Interpreter into open-source LLMs. We curate 144 reasoning and planning tasks and synthesize 6.5k multi-turn text/code trajectories for This difficulty arises from task heterogeneity and the scarcity of effective samples.
- Asia > Vietnam > Hanoi > Hanoi (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > United States > Wisconsin > Dane County > Madison (0.04)
- (2 more...)
- Instructional Material (0.66)
- Research Report (0.50)
When GPT Spills the Tea: Comprehensive Assessment of Knowledge File Leakage in GPTs
Shen, Xinyue, Shen, Yun, Backes, Michael, Zhang, Yang
Knowledge files have been widely used in large language model (LLM) agents, such as GPTs, to improve response quality. However, concerns about the potential leakage of knowledge files have grown significantly. Existing studies demonstrate that adversarial prompts can induce GPTs to leak knowledge file content. Yet, it remains uncertain whether additional leakage vectors exist, particularly given the complex data flows across clients, servers, and databases in GPTs. In this paper, we present a comprehensive risk assessment of knowledge file leakage, leveraging a novel workflow inspired by Data Security Posture Management (DSPM). Through the analysis of 651,022 GPT metadata, 11,820 flows, and 1,466 responses, we identify five leakage vectors: metadata, GPT initialization, retrieval, sandboxed execution environments, and prompts. These vectors enable adversaries to extract sensitive knowledge file data such as titles, content, types, and sizes. Notably, the activation of the built-in tool Code Interpreter leads to a privilege escalation vulnerability, enabling adversaries to directly download original knowledge files with a 95.95% success rate. Further analysis reveals that 28.80% of leaked files are copyrighted, including digital copies from major publishers and internal materials from a listed company. In the end, we provide actionable solutions for GPT builders and platform providers to secure the GPT data supply chain.
Towards Effective Code-Integrated Reasoning
Bai, Fei, Min, Yingqian, Zhang, Beichen, Chen, Zhipeng, Zhao, Wayne Xin, Fang, Lei, Liu, Zheng, Wang, Zhongyuan, Wen, Ji-Rong
In this paper, we investigate code-integrated reasoning, where models generate code when necessary and integrate feedback by executing it through a code interpreter. To acquire this capability, models must learn when and how to use external code tools effectively, which is supported by tool-augmented reinforcement learning (RL) through interactive learning. Despite its benefits, tool-augmented RL can still suffer from potential instability in the learning dynamics. In light of this challenge, we present a systematic approach to improving the training effectiveness and stability of tool-augmented RL for code-integrated reasoning. Specifically, we develop enhanced training strategies that balance exploration and stability, progressively building tool-use capabilities while improving reasoning performance. Through extensive experiments on five mainstream mathematical reasoning benchmarks, our model demonstrates significant performance improvements over multiple competitive baselines. Furthermore, we conduct an in-depth analysis of the mechanism and effect of code-integrated reasoning, revealing several key insights, such as the extension of model's capability boundaries and the simultaneous improvement of reasoning efficiency through code integration. All data and code for reproducing this work are available at: https://github.com/RUCAIBox/CIR.
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
- Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (1.00)
- Information Technology > Artificial Intelligence > Natural Language (0.97)
OccamLLM: Fast and Exact Language Model Arithmetic in a Single Step
Language model systems often enable LLMs to generate code for arithmetic operations to achieve accurate calculations. However, this approach compromises speed and security, and fine-tuning risks the language model losing prior capabilities. We propose a framework that enables exact arithmetic in *a single autoregressive step*, providing faster, more secure, and more interpretable LLM systems with arithmetic capabilities. We use the hidden states of a LLM to control a symbolic architecture that performs arithmetic. Furthermore, OccamLlama outperforms GPT 4o with and without a code interpreter on average across a range of mathematical problem solving benchmarks, demonstrating that OccamLLMs can excel in arithmetic tasks, even surpassing much larger models.
ReTool: Reinforcement Learning for Strategic Tool Use in LLMs
Feng, Jiazhan, Huang, Shijue, Qu, Xingwei, Zhang, Ge, Qin, Yujia, Zhong, Baoquan, Jiang, Chengquan, Chi, Jinxin, Zhong, Wanjun
While reasoning models (e.g., DeepSeek R1) trained with reinforcement learning (RL), excel in textual reasoning, they struggle in scenarios requiring structured problem-solving, such as geometric reasoning, concise computation, or complex equation solving-areas where computational tools like code interpreters (CI) demonstrate distinct advantages. To bridge this gap, we propose ReTool, which enhances long-form reasoning with tool-integrated learning, including two key features: (1) dynamic interleaving of real-time code execution within natural language reasoning processes, and (2) an automated RL paradigm that allows policy rollouts with multi-turn real-time code execution and teaches the model in learning when and how to invoke tools based on outcome feedback. ReTool employs a systematic training framework, beginning with synthetic cold-start data generation to produce code-augmented long-form reasoning traces for fine-tuning base models. Subsequent RL training leverages task outcomes as rewards to iteratively refine the model's tool use strategy, enabling autonomous discovery of optimal tool invocation patterns without human priors. Experiments on the challenging MATH Olympiad benchmark AIME demonstrate ReTool's superiority: Our 32B model achieves 67% accuracy with 400 training steps, outperforming text-based RL baseline (40% accuracy, 1080 steps) in efficiency and performance. Remarkably, ReTool-32B attains 72.5% accuracy in extended settings, surpassing OpenAI's o1-preview by 27.9%. Further analysis reveals emergent behaviors such as code self-correction, signaling an ''aha moment'' in which the model autonomously masters adaptive tool use. These findings highlight the promise of outcome-driven tool integration for advancing complex mathematical reasoning and offer new insights into hybrid neuro-symbolic systems.
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- Europe > Monaco (0.04)
- (2 more...)
Testing GPT-4-o1-preview on math and science problems: A follow-up study
In August 2023, Scott Aaronson and I reported the results of testing GPT4 with the Wolfram Alpha and Code Interpreter plug-ins over a collection of 105 original high-school level and college-level science and math problems (Davis and Aaronson, 2023). In September 2024, I tested the recently released model GPT-4o1-preview on the same collection. Overall I found that performance had significantly improved, but was still considerably short of perfect. In particular, problems that involve spatial reasoning are often stumbling blocks. On September 12, OpenAI (2024) released two preliminary versions, "ChatGPT-o1-preview" and "ChatGPT-o1-mini" of a forthcoming product "ChatGPT-o1".
- Europe > France (0.05)
- North America > United States > California > San Francisco County > San Francisco (0.05)
- North America > Canada > Quebec (0.05)
- (10 more...)
- Education > Educational Setting (0.55)
- Government > Space Agency (0.47)
PyBench: Evaluating LLM Agent on various real-world coding tasks
Zhang, Yaolun, Pan, Yinxu, Wang, Yudong, Cai, Jie
The LLM Agent, equipped with a code interpreter, is capable of automatically solving real-world coding tasks, such as data analysis and image editing. However, existing benchmarks primarily focus on either simplistic tasks, such as completing a few lines of code, or on extremely complex and specific tasks at the repository level, neither of which are representative of various daily coding tasks. To address this gap, we introduce \textbf{PyBench}, a benchmark encompassing five main categories of real-world tasks, covering more than 10 types of files. Given a high-level user query and related files, the LLM Agent needs to reason and execute Python code via a code interpreter for a few turns before making a formal response to fulfill the user's requirements. Successfully addressing tasks in PyBench demands a robust understanding of various Python packages, superior reasoning capabilities, and the ability to incorporate feedback from executed code. Our evaluations indicate that current open-source LLMs are struggling with these tasks. Hence, we conduct analysis and experiments on four kinds of datasets proving that comprehensive abilities are needed for PyBench. Our fine-tuned 8B size model: \textbf{PyLlama3} achieves an exciting performance on PyBench which surpasses many 33B and 70B size models. Our Benchmark, Training Dataset, and Model are available at: {https://github.com/Mercury7353/PyBench}